Data exploration

Introduction

In a well-defined study, exploratory analysis may be largely unnecessary. Consider a scenario where the client has identified questions / hypotheses of interest and the additional variables that may predict or mediate the identified outcomes. An activity partner such as MSI was provided the time and resources to consult with stakeholders, scope out the target environment to identify and map the data generating process, and develop an inferential design by which the collected data and analytical routines would address the questions posed by the client. In this scenario, exploratory analysis may be entirely eliminated, and the time that may be spent on exploration is shifted to sensitivity analysis of the pre-identified analytical routines.

On the other hand, consider a scenario where data has been collected for one purpose, and later realized to have possible use with other, not yet identified, purposes. In this case, the initial analysis is entirely exploratory.

Data visualization

Data visualization tactics

Incorporating labels into lines or other geoms (geomtextpath)

There are common situations where MSI is tasked with ongoing data collection and evaluation of a client activity. For example, the MENA MELS activity (2020-2024) was tasked with ongoing monitoring and evaluation of the Middle East Partnership for Peace Activity (MEPPA). MEPPA comprised grants to several local partners organized around the common motif of building bonds between different demographic groups. The primary outcome of interest was whether or not a participant in the grant activity reported a perceived increase in understanding the political, social, and economic situation and viewpoints of another group. Data were collected on a rolling basis across several implementing partners, across baseline and endline. That data are summarized as follows:

df1 <- read_csv("data/short demo series/meppa response items.csv",
                show_col_types=F)
df1 %>%
  flextable() %>% 
  autofit()

item

response

lab

endline

n

perc

Political views

1

Basic understanding

0

22

0.286

Political views

2

Fair understanding

0

38

0.494

Political views

3

High understanding

0

17

0.221

Political views

1

Basic understanding

1

18

0.173

Political views

2

Fair understanding

1

50

0.481

Political views

3

High understanding

1

36

0.346

Social views

1

Basic understanding

0

30

0.390

Social views

2

Fair understanding

0

25

0.325

Social views

3

High understanding

0

22

0.286

Social views

1

Basic understanding

1

15

0.147

Social views

2

Fair understanding

1

52

0.510

Social views

3

High understanding

1

35

0.343

Economic views

1

Basic understanding

0

32

0.416

Economic views

2

Fair understanding

0

30

0.390

Economic views

3

High understanding

0

15

0.195

Economic views

1

Basic understanding

1

20

0.196

Economic views

2

Fair understanding

1

55

0.539

Economic views

3

High understanding

1

27

0.265

For purposes of client reporting, there are three ordinal responses across baseline and endline, for each of three types of viewpoint. There is insufficient data to conduct inferential tests at this level of granularity, but this data may be visualized descriptively.

The geomtextpath package offers the functionality to directly label line-based plots with text that is able to follow a curved path. Simply replace ‘geom_line’ with ‘geom_textpath’ and assign the variable to use as the label. The following figures illustrate.

pol <- ggplot(filter(df1, item=="Political views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Political situation") 

pol

soc <- ggplot(filter(df1, item=="Social views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Social situation") 

soc

ec <- ggplot(filter(df1, item=="Economic views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Economic situation") 

ec

Given that the three types of understanding of others’ situation are highly correlated, it makes sense to present these measures compactly as aspects of a deeper underlying construct. The patchwork library allows multiple ggplots to be assembled together as a single plot. The following figure illustrates.

pol + soc + ec + 
  plot_annotation(title="How well do you understand the situation of others?")

A final use of presenting the data more compactly is to collapse the ordinal responses to binary, and collect the three measures as lines in a single plot. The following data captures each type of understanding as either fair or high understanding as one category, and basic understanding as the other category.

dat <- read_csv("data/short demo series/meppa item ladder.csv",
                show_col_types=F)

dat_flx <- dat %>%
  flextable() %>%
  autofit() 

dat_flx

endline

n

perc

item

0

55

0.714

Political situation

1

86

0.827

Political situation

0

47

0.610

Social situation

1

87

0.853

Social situation

0

45

0.584

Economic situation

1

82

0.804

Economic situation

With this simplified data summary, the trendline for each type of understanding may now be collected in a single plot.

ggplot(dat, aes(endline, perc, color=as.factor(item))) + 
  geom_textpath(aes(label=item),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1),
                     breaks=c(.5,1),
                     sec.axis=dup_axis(breaks=c(.804,.827,.853),
                                       labels=c("+22","+12","+24"))) +
  theme(legend.position="none",
        axis.text.y.left=element_blank()) +
  labs(x="",
       y="",
       caption="Proportion reporting fair or high\nunderstanding of others' situation")

Note further the use of secondary axis breaks to illustrate the change score for each trendline from baseline to endline.

The R computing language allows for several ways to customize the use of labels in statistical or descriptive graphics. This short demo has illustrated MSI’s use of the geomtextpath package to place labels directly along the line or curve of a plot. This illustration used only straight lines between two points in time. For additional use cases of the geomtextpath package, see the package vignette.

Gantt charts

Sometimes it can be helpful to incorporate a more visual presentation of Gantt charts into our planning documents and client communications. The following table is an example of a simplified Gantt that was extracted from a larger Gantt for the purposes of identifying the specific areas that could be subject to monitoring or evaluation activities.

gantt <- read_excel("data/short demo series/aqbe - GANTT.xlsx")

gantt %>%
  flextable()

num

act

Activity

Cohort

Climate zone

Label

Label2

Label3

Label4

Start

Finish

1

1.1.2

Professional development instruments and models

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers

35

2024-11-01 00:00:00

2025-01-01 00:00:00

2

1.1.4

Teachers trained on UNICEF Package

1

Warm

Cohort 1, Year 1, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

2024-05-01 00:00:00

2024-09-01 00:00:00

3

1.1.4

Teachers trained on UNICEF Package

1

Cold

Cohort 1, Year 2, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

2024-12-01 00:00:00

2025-04-01 00:00:00

4

1.1.4

Teachers trained on UNICEF Package

2

Warm

Cohort 2, Year 3, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

2026-05-01 00:00:00

2026-09-01 00:00:00

5

1.1.4

Teachers trained on UNICEF Package

2

Cold

Cohort 2, Year 4, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

2026-12-01 00:00:00

2027-04-01 00:00:00

6

1.1.8

Teacher-learner materials

1

Warm

Cohort 1, Year 1, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

2024-05-01 00:00:00

2024-09-01 00:00:00

7

1.1.8

Teacher-learner materials

1

Cold

Cohort 1, Year 2, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

2024-12-01 00:00:00

2025-04-01 00:00:00

8

1.1.8

Teacher-learner materials

2

Warm

Cohort 2, Year 3, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

2026-05-01 00:00:00

2026-09-01 00:00:00

9

1.1.8

Teacher-learner materials

2

Cold

Cohort 2, Year 4, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

2026-12-01 00:00:00

2027-04-01 00:00:00

10

1.2.1

Targeted remediation

1

Warm

Cohort 1, Year 2

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

2024-10-01 00:00:00

2025-10-01 00:00:00

11

1.2.1

Targeted remediation

2

Cold

Cohort 2, Year 4

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

2026-10-01 00:00:00

2027-10-01 00:00:00

To visualize this as a Gantt chart using the ggplot package in R, we first need to stack the dates in rows rather than columns. Note that R and ggplot usually prefer to work with data in long format (stacked rows) rather than wide format (variable values as columns).

gant2 <- gantt %>%
  pivot_longer(10:11, # the date columns to pivot
               names_to="Type", # collapse the Start and Finish variables into a single variable
               values_to="Date") # the values of the Start and Finish variables go here

gant2 %>%
  flextable()

num

act

Activity

Cohort

Climate zone

Label

Label2

Label3

Label4

Type

Date

1

1.1.2

Professional development instruments and models

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers

35

Start

2024-11-01 00:00:00

1

1.1.2

Professional development instruments and models

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers and 7 Teacher Educators trained

35 Master Trainers

35

Finish

2025-01-01 00:00:00

2

1.1.4

Teachers trained on UNICEF Package

1

Warm

Cohort 1, Year 1, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Start

2024-05-01 00:00:00

2

1.1.4

Teachers trained on UNICEF Package

1

Warm

Cohort 1, Year 1, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Finish

2024-09-01 00:00:00

3

1.1.4

Teachers trained on UNICEF Package

1

Cold

Cohort 1, Year 2, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Start

2024-12-01 00:00:00

3

1.1.4

Teachers trained on UNICEF Package

1

Cold

Cohort 1, Year 2, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Finish

2025-04-01 00:00:00

4

1.1.4

Teachers trained on UNICEF Package

2

Warm

Cohort 2, Year 3, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Start

2026-05-01 00:00:00

4

1.1.4

Teachers trained on UNICEF Package

2

Warm

Cohort 2, Year 3, Warm Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Finish

2026-09-01 00:00:00

5

1.1.4

Teachers trained on UNICEF Package

2

Cold

Cohort 2, Year 4, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Start

2026-12-01 00:00:00

5

1.1.4

Teachers trained on UNICEF Package

2

Cold

Cohort 2, Year 4, Cold Climate

400 CBE / 2,625 teachers trained

400 CBE / 2,625 teachers

3025

Finish

2027-04-01 00:00:00

6

1.1.8

Teacher-learner materials

1

Warm

Cohort 1, Year 1, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Start

2024-05-01 00:00:00

6

1.1.8

Teacher-learner materials

1

Warm

Cohort 1, Year 1, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Finish

2024-09-01 00:00:00

7

1.1.8

Teacher-learner materials

1

Cold

Cohort 1, Year 2, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Start

2024-12-01 00:00:00

7

1.1.8

Teacher-learner materials

1

Cold

Cohort 1, Year 2, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Finish

2025-04-01 00:00:00

8

1.1.8

Teacher-learner materials

2

Warm

Cohort 2, Year 3, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Start

2026-05-01 00:00:00

8

1.1.8

Teacher-learner materials

2

Warm

Cohort 2, Year 3, Warm Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Finish

2026-09-01 00:00:00

9

1.1.8

Teacher-learner materials

2

Cold

Cohort 2, Year 4, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Start

2026-12-01 00:00:00

9

1.1.8

Teacher-learner materials

2

Cold

Cohort 2, Year 4, Cold Climate

2,625 teachers / 52,500 students

2,625 teachers / 52,500 students

2,625 T / 52,500 S

Finish

2027-04-01 00:00:00

10

1.2.1

Targeted remediation

1

Warm

Cohort 1, Year 2

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

Start

2024-10-01 00:00:00

10

1.2.1

Targeted remediation

1

Warm

Cohort 1, Year 2

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

Finish

2025-10-01 00:00:00

11

1.2.1

Targeted remediation

2

Cold

Cohort 2, Year 4

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

Start

2026-10-01 00:00:00

11

1.2.1

Targeted remediation

2

Cold

Cohort 2, Year 4

1,050 teachers / 10,500 students

1,050 teachers / 10,500 students

1,050 T / 10,500 S

Finish

2027-10-01 00:00:00

Now we can plot the dates on the x-axis and the activity on the y-axis (ordered by the index variable). Our geom will be a line, and an additional aesthetic will be to color the lines by one of our labels.

ggplot(gant2, # the data to use 
       aes(Date, # x aesthetic
           fct_reorder(Activity,num), # order the Activity aesthetic according to the num variable 
           color=Label)) + # add an additional aesthetic
  geom_line(linewidth=5) + # for every unique value of Activity, draw a line between the dates
  scale_x_datetime(limits=c(as.POSIXct("2024-02-01"), as.POSIXct("2027-09-01")), # range of x-axis
                   date_breaks="6 months", # how far apart to set the tick marks
                   date_labels="%b-%y") + # format the tick labels to Mon-YY
  scale_y_discrete(labels=label_wrap(25)) + # cuts a long y-axis label into two lines
  scale_color_viridis_d() + # use a color-blind friendly palette
  labs(x="", # no label
       y="", # no label
       title="Illustrative Gantt chart") + # figure title
  theme(legend.position="none", # remove legend to declutter
        plot.background = element_rect(fill = "aliceblue"), 
        panel.background = element_rect(fill = "aliceblue"))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

This Gantt is more visually appealing, and still retains utility in highlighting what activities occur and when. Note that we have suppressed the legend created by the color aesthetic, but we could have kept it in order to communicate additional information.

Additional resources:

Simple Gantt charts in R with ggplot2 … and Microsoft Excel

Gantt charts using ggplott and Plotly

R Shiny Gantt Chart

How to Create a Gantt Chart in R Using ggplot2

Happy Gantting!

Mapping

Much of our data is collected from surveys and has geographic coordinates associated with a point of collection, a house, or a city, etc. It can be useful to generate a map to get a sense of where the data is coming from. Mapping is a part of MSI’s exploratory data analysis process and is also used in developing a sampling plan and in producing high quality data vizualizations for clients. This section provides a brief introduction into mapping in R.

The most commonly used packages to handle spatial data are sf for vectors, terra for vectors and rasters, and raster for rasters. To visualize the data, two frequently used packages are tmap and ggplot2 packages.

To get started, we need to load our packages.

#run this line first if you have never used these packages before
#install.packages(c("tidyverse", "sf", "tmap", "readr", "here"))

library(tidyverse) #install the core tidyverse packages including ggplot2
library(sf) #provides tools to work with vector data 
library(tmap) #for visualizing spatial data
library(readr) #functions for reading external datasets 
library(here) #to better locate files not in working directory
library(geodata) #to download administrative boundaries

Read in the data

#It is a csv file so I use the read_csv function and provide the file path
cities <- read_csv(here::here("../methods corner/Map demo/data/Madagascar_Cities.csv")
                   , show_col_types = FALSE)

#Observe the first few rows of data
DT::datatable(head(cities))

Now, for the administrative boundaries. Each of these are being read in using gadm() from the geodata package and then converted to an sf object with the st_as_sf() function. In the example, we download only the country boundary, but if we wanted regions or departments, we would simply change the level argument inside the gadm() function call to a 1 or a 2.

Country boundary

#This is only the country boundary. 
mdg <- geodata::gadm(country = "MDG"
                  , level = 0
                  , path = tempdir()) |>
  st_as_sf()

Convert the cities to an sf object

Remember that the cities object is a standard .csv with longitude and latitude columns, but it is not yet recognized as an sf object that has geographic properties. Here is how to convert it to an sf object with a single geometry column and a crs.

cities_sf <- cities |>
  st_as_sf(coords = c("Longitude", "Latitude")
           , crs = 4326)

#observe the first few rows of data
DT::datatable(head(cities_sf))

Make the map

The following code chunks and tabs walk through the process of making and improving a map in both tmap and ggplot2. In the example, cities are what we are plotting, but we could be plotting any variable of a dataset.

tmap_mode("plot") +
  tm_shape(mdg) +
  tm_polygons() + #for only the borders, use tm_borders()
  tm_shape(cities_sf) +
  tm_dots(size = .25, col = "red")

ggplot2::ggplot(mdg) +
  geom_sf() +
  geom_sf(data = cities_sf, color = "red")

Make the map better

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.7 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.75 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box ...
  st_as_sfc() # ... and make it a sf polygon

#now plot the map
tmap_mode("plot") +
  tm_shape(mdg, bbox = bbox_new) +
  tm_polygons() +
  tm_shape(cities_sf) +
  tm_dots(size = .25, col = "red") +
  tm_text(text = "Name", auto.placement = T) +
  tm_layout(title = "Main Cities of\nMadagascar")

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.5 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.5 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box
  st_as_sfc() # ... and make it a sf polygon


ggplot2::ggplot() +
  geom_sf(data = mdg) +
  geom_sf(data = cities_sf, color = "red") +
  ggrepel::geom_text_repel(data = cities_sf
               , aes(label = Name
                     , geometry = geometry)
               , stat = "sf_coordinates"
               , min.segment.length = 0) +
  coord_sf(xlim = st_coordinates(bbox_new)[c(1,2),1], # min & max of x values
           ylim = st_coordinates(bbox_new)[c(2,3),2]) + # min & max of y values +
  labs(title = "Main Cities of\nMadagascar") +
  theme_void()

Final touches

Now that we have a map with cities plotted (we achieved our goal!), we will add a few finishing touches and set the size of the city points to the population variable in the original dataset.

Additionally, tmap provides a simple interface to go from a static map to an interative map simply by changing tmap_mode("plot") to tmap_mode("view").

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.7 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.75 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box ...
  st_as_sfc() # ... and make it a sf polygon
tmap_mode("plot") +
  tm_shape(mdg, bbox = bbox_new) +
  tm_polygons() +
  tm_shape(cities_sf) +
  tm_dots(size = "Population", col = "red"
          , legend.size.is.portrait = TRUE) +
  tm_text(text = "Name", auto.placement = T
          , along.lines = T) +
  tm_scale_bar(position = c("left", "bottom"), width = 0.15) +
  tm_compass(type = "4star"
             , position = c("right", "bottom")
             , size = 2) +
  tm_layout(main.title = "Main Cities of Madagascar"                         , legend.outside = TRUE)

ggplot2::ggplot() +
  geom_sf(data = mdg) +
  geom_sf(data = cities_sf, aes(size = Population)
          , color = "red") +
  ggrepel::geom_text_repel(data = cities_sf
               , aes(label = Name
                     , geometry = geometry)
               , stat = "sf_coordinates"
               , min.segment.length = 0) +
  coord_sf(xlim = st_coordinates(bbox_new)[c(1,2),1], # min & max of x values
           ylim = st_coordinates(bbox_new)[c(2,3),2]) + # min & max of y values +
  ggspatial::annotation_scale(location = "bl") +
  ggspatial::annotation_north_arrow(location = "br"
                                    , which_north = "true"
                                    , size = 1)+
  labs(title = "Main Cities of Madagascar") +
  theme_void()

tmap_mode("view") +
  tm_shape(mdg) +
  tm_borders() +
  tm_shape(cities_sf) +
  tm_dots(size = "Population", col = "red"
          , legend.size.is.portrait = TRUE) +
  tm_text(text = "Name", auto.placement = T
          , along.lines = T) +
  tm_scale_bar(position = c("left", "bottom"), width = 0.15) +
  tm_compass(type = "4star"
             , position = c("right", "bottom")
             , size = 2) +
  tm_layout(main.title = "Main Cities of Madagascar"                         , legend.outside = TRUE)

Additional Resources

For those interested in mapping in R (or QGIS) there are many free resources available online. A great starting point for R is the online text book, Geocomputation with R. If you would rather learn more in Python, Geocomputation with Python is a great resource.